In this blog post, we are going to talk about a few tools
Everyone loves tree based models. Gradient boosting, random forests, and friends are wonderful, flexible tools. One of the other benefits of these models, because of their tree-ness, is that we are able to actually see how “important” each variable is in the decisions the model is making. This is also one of many many reasons why we love linear models, we can actually see and quantify the strength of a feature in our model. However, why must we limit ourselves to just linear and tree based models?
Lets try and think of a new approach to get variable importance. When I was first really getting into ML, I remember asking one of my professors the question: “How much time do you spend on feature engineering?”. I will never forget his answer, he told me: “Feature engineering [is] the most crucial part to improve both accuracy and model generation. If you have an unneccessary feature in the model, you are in essence fitting noise.”. This has stuck with me for a long time, and it is a useful thing to keep in mind while discussing permutation importance.
If unneccessary features just provide noise which decreases model accuracy and generalization, what happens if we replace a good feature with noise? Our model should be inherently worse, no? This is the key idea of permutation importance.
If I replace a feature with noise, how much worse does the model perform?
This is the key idea of permutation based variable importance. All we are going to do to calculate this is three simple steps:
Calculate prediction loss
Replace a feature with noise
Recalculate prediction loss
Compare
Lets formulate permutation importance mathematically now! First, lets define our data as the set \(x\), with \(m\) observations and \(n\) features. Next, lets consider two sets within \(x\): \(x_s\) and \(x_c\). \(x_s\) represents the feature(s) we are interested in, and \(x_c\) represents the complement of \(x_s\) (in english, everything else). Thus:
\[x = (x_s, x_c)\]
Lets first define the original loss with the original features as \(\mathcal{L}\),
\[ \mathcal{L} = \mathrm{loss}\left(x_s, x_c \right) \]
Next, we need to replace \(x_s\) with noise. To do that, we want to sample the marginal distribtion of \(x_s\). This means we want to sample the distribution of \(x_s\) independent of other features. With a reasonably sized dataset, we can just do a permutation of \(x_s\) for more or less the same result. We will denote the permutation of \(x_s\) as \(x_s^*\). Next, lets define the loss, \(\mathcal{L}*\) of the permuted feature: \[ \mathcal{L}* = \mathrm{loss}\left(x_s^*, x_c \right) \]
Finally, we can calculate the variable importance of \(x_s\):
\[ VIP_{\mathrm{perm}}(x_s) = \frac{\mathcal{L}*}{\mathcal{L}} \]
There we go! Its that simple! An addendum to this suggested by Jeremy Howard of fast.ai: Add a feature of pure noise and see how important that is, for reference.
For our data, we will use the dataset discussed in Benjamin Tayo’s amazing blog post. We use this dataset because it presents a large challenge to us, with highly correlated features. This will show some of the pitfalls of some of our techniques.
The goal with this dataset is to predict the number of crew members which will be on a cruise ship, given some paramters describing the ship. I believe the independent variables are fairly self explanatory. First, lets read the data into python and do a train test split:
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston, fetch_california_housing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error as loss_mse
import math
import statistics as stats
import matplotlib.cm as cm
from pprint import pprint
# so this is readable:
import warnings
from sklearn.exceptions import DataConversionWarning, ConvergenceWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=ConvergenceWarning)
cruise = pd.read_csv("https://github.com/bot13956/ML_Model_for_Predicting_Ships_Crew_Size/raw/master/cruise_ship_info.csv")
X = cruise.loc[:, cruise.columns != "crew"]
X = X.loc[:, X.columns != "Ship_name"]
X = X.loc[:, X.columns != "Cruise_line"]
y = cruise.loc[:, cruise.columns == "crew"]
def split(df, p_train = 0.75, random_state = 0):
train = df.sample(frac = p_train, random_state = random_state)
test = df.drop(train.index)
return(train, test)
(X_train, X_test), (y_train, y_test) = (split(x) for x in [X, y])Next, lets use the amazing reticulate package to pass these exact data frames into R:
X <- py$X
y <- py$y
X_train <- py$X_train
X_test <- py$X_test
y_train <- py$y_train
y_test <- py$y_test
XNow we are all set up to implement permutation importance in python and in R
(Click tabs below to change language!)
First, lets set up three models to test: A linear model, a neural network, and a random forest:
lm = LinearRegression()
knn = KNeighborsRegressor(13)
rf = RandomForestRegressor(n_estimators = 100)
models = [lm, knn, rf]
for m in models:
m.fit(X_train, y_train)#> LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
#> KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
#> metric_params=None, n_jobs=None, n_neighbors=13, p=2,
#> weights='uniform')
#> RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
#> max_features='auto', max_leaf_nodes=None,
#> min_impurity_decrease=0.0, min_impurity_split=None,
#> min_samples_leaf=1, min_samples_split=2,
#> min_weight_fraction_leaf=0.0, n_estimators=100,
#> n_jobs=None, oob_score=False, random_state=None,
#> verbose=0, warm_start=False)
Lets check out as a baseline truth which features are important in the random forest, and the strength of the predictors in our linear regression:
#> {'Age': 0.017940573693502024,
#> 'Tonnage': 0.41592715941929437,
#> 'cabins': 0.4194355080754462,
#> 'length': 0.07782246312009274,
#> 'passenger_density': 0.014576681398860572,
#> 'passengers': 0.0542976142928041}
#> {'Age': array([-0.01321228]),
#> 'Tonnage': array([0.00316145]),
#> 'cabins': array([0.79858261]),
#> 'length': array([0.39111394]),
#> 'passenger_density': array([0.01037499]),
#> 'passengers': array([-0.1045376])}
Looks like tonnage and cabins are the most important variables for the random forest, and cabins, length, and passengers for linear regression.
Next, lets create a function for permutation importance! This should be pretty easy to do, I have added in a few extra components for thoroughness, but the code is not hard:
def permutation_importance(model, x, y, loss, base = False, x_train = None, y_train = None, kind = "prop", n_rounds = 5):
explan = x.columns
baseline = loss(y, model.predict(x))
res = {k:[] for k in explan}
if (base is True):
res["baseline"] = []
for n in range(0, n_rounds):
for i in range(0, len(explan)):
col = explan[i]
x_temp = x.copy()
x_temp[col] = np.random.permutation(x_temp[col])
if (kind is not "prop"):
res[col].append(loss(y, model.predict(x_temp)) - baseline)
else:
res[col].append(loss(y, model.predict(x_temp)) / baseline)
if (base is True):
x_temp = x.copy()
x_train2 = x_train.copy()
x_temp["baseline"] = np.clip(np.random.normal(size = len(x_temp)), -1., 1.)
x_train2["baseline"] = np.clip(np.random.normal(size = len(x_train2)), -1., 1.)
mod2 = type(model)()
mod2.fit(x_train2, y_train)
if (kind is not "prop"):
res["baseline"].append(loss(y, mod2.predict(x_temp)) - baseline)
else:
res["baseline"].append(loss(y, mod2.predict(x_temp)) / baseline)
return(pd.DataFrame.from_dict(res))Lets also define a helper function to help us do this for all our models, and then go ahead and calculate the importances!
# convert object name to string!
def get_name(obj):
name =[x for x in globals() if globals()[x] is obj][0]
return(name)
imps = {}
for m in models:
imps[get_name(m)] = permutation_importance(m, X_test, y_test, loss_mse, True, X_train, y_train, n_rounds = 5)
pprint(imps)#> {'knn': Age Tonnage passengers ... cabins passenger_density baseline
#> 0 0.953594 10.460554 1.363738 ... 0.977356 1.080228 1.310102
#> 1 0.913625 20.251634 1.070447 ... 1.059398 1.287608 1.318981
#> 2 1.029486 19.060316 1.190654 ... 0.934822 1.397907 1.312979
#> 3 1.054564 24.971323 1.262756 ... 0.979389 1.595057 1.306668
#> 4 0.780959 17.193917 1.113051 ... 1.014168 1.219071 1.309752
#>
#> [5 rows x 7 columns],
#> 'lm': Age Tonnage passengers ... cabins passenger_density baseline
#> 0 0.943507 1.157280 7.880352 ... 72.529664 1.148155 1.042532
#> 1 0.953535 0.956423 7.847115 ... 75.157145 1.049876 1.000712
#> 2 1.151137 1.027036 7.383386 ... 69.412267 0.942257 0.992352
#> 3 1.079984 0.969374 7.203343 ... 58.536218 1.070731 1.072106
#> 4 1.013006 1.057246 9.071541 ... 80.655969 1.154589 1.035966
#>
#> [5 rows x 7 columns],
#> 'rf': Age Tonnage passengers ... cabins passenger_density baseline
#> 0 1.391592 14.571149 2.146399 ... 21.261009 1.461168 1.619377
#> 1 1.132233 15.763448 2.168587 ... 20.955786 1.511130 1.610004
#> 2 1.385285 11.477794 1.971031 ... 13.162853 1.806996 1.566385
#> 3 1.359806 20.469226 2.993920 ... 21.803024 1.564689 1.655168
#> 4 1.311191 13.343430 1.964030 ... 21.434352 1.573079 1.972894
#>
#> [5 rows x 7 columns]}
This output is a bit hard to read, so lets go ahead and write a helper function which averages the importances, and plots them nicely!
plt.style.use("ggplot")
def plot_perm_imp(df, ax = None, color = 'blue'):
# mean the columns and put it back into a data frame
df1 = (df.apply(stats.mean, 0, result_type = "broadcast")).drop(df.index[1:])
# create a new data frame excluding baseline, so we can do something special with it
df_temp = df1.loc[:, df.columns != 'baseline']
# melt and sort it for plotting
df2 = df_temp.melt(var_name = 'variable', value_name = 'importance')
df2 = df2.sort_values(by = "importance")
# plot it nicely
df2.plot(kind = 'barh', x = 'variable', y = 'importance', width = 0.8, ax = ax, color = color)
# draw a bar and an arrow to baseline
for n in df1.columns:
if n is "baseline":
plt.axvline(x = df1[n][0])
plt.annotate('baseline',
xy = (df1[n][0], 1),
xytext = (df1[n][0] + 0.4, 3),
arrowprops = dict(facecolor = 'black',
shrink = 0.05),
bbox = dict(boxstyle = "square", fc = (1,1,1)))
# plot the importance dict!
fig = plt.figure(figsize=(10,10))#> /nix/store/drr8qcgiccfc5by09r5zc30flgwh1mbx-python3-3.7.5/bin/python:1: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
for i in range(len(imps.keys())):
ax = fig.add_subplot(len(list(imps.keys())),1, i+1)
c = sns.color_palette("hls", i+1)[i]
plot_perm_imp(imps[list(imps.keys())[i]], ax = ax, color = c)
ax.set_title(list(imps.keys())[i])
for tick in ax.yaxis.get_major_ticks():
tick.label.set_fontsize("x-small")
tick.label.set_rotation(45)
plt.show()Permutation Importances for linear regression, knn, and a random forest
Looks like it worked alright!! Especially for random forest, we nicely identified the two most important features, however with little granularity. With our linear model, it is a little less clear, but we do see that the same three most important features carry through! However, we have no real granularity or idea the scale of the effect of the variable. We also carry a risk with permutation importance and correlated variables in general: we are using often unrealistic observations. For example, we may be looking a tiny boats (in length) which weigh the same as the largest boats (they would sink!!) with our permutations. This leads to often unreliable output with many permutation based tools, which we will also explore in this post.
library(randomForest)
library(kknn)
# set up models with parameters
rf <- function(df) {
return(randomForest(crew ~ ., data = df, ntree = 100))
}
# knn in R is annoying so we will need to a consistent api ourselves:
knn <- function(df) {
res <- list()
res$train <- df
res$k <- 13
res <- structure(res, class = "knn")
}
predict.knn <- function(obj, newdata) {
out <- kknn(crew ~ ., train = obj$train, test = newdata, k = 13)
return(as.numeric(out$fitted.values))
}
linear_model <- function(df) {
lm(crew ~ ., data = df)
}
models <- list("lm" = linear_model,"knn"= knn,"rf"= rf)
t_train <- cbind(X_train, y_train)
trained_models <- lapply(models, function(f) f(t_train))Lets check out as a baseline truth which features are important in the random forest, and the strength of the predictors in our linear regression:
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -0.9349 | 1.878 | -0.4977 | 0.6197 |
| Age | -0.01321 | 0.02083 | -0.6343 | 0.5272 |
| Tonnage | 0.003161 | 0.01692 | 0.1868 | 0.8522 |
| passengers | -0.1045 | 0.06831 | -1.53 | 0.1288 |
| length | 0.3911 | 0.1624 | 2.409 | 0.01765 |
| cabins | 0.7986 | 0.1058 | 7.545 | 1.34e-11 |
| passenger_density | 0.01037 | 0.02618 | 0.3963 | 0.6927 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 118 | 1.089 | 0.906 | 0.9009 |
| IncNodePurity | |
|---|---|
| Age | 65.06 |
| Tonnage | 371.6 |
| passengers | 196.5 |
| length | 280.3 |
| cabins | 414.8 |
| passenger_density | 25.14 |
Looks like tonnage and cabins are the most important variables for the random forest, and cabins, length, and passengers for linear regression.
Next, lets create a function for permutation importance! This should be pretty easy to do, I have added in a few extra components for thoroughness, but the code is not hard:
# loss function
loss_mse <- function(truth, preds) {
error <- truth - preds
square_error <- error^2
return(mean(square_error))
}
# get baseline loss
get_loss <- function(model, x, y, loss) {
loss(y, predict(model, x))
}
# permute a single column
permute_column <- function(df, col) {
df[[col]] <- df[[col]][sample(1:nrow(df))]
return(df)
}
permutation_importance <- function(model, x, y, loss, x_train = NA, y_train = NA, n_rounds = 5) {
baseline <- get_loss(model, x, y[[1]], loss)
explan <- names(x)
single_round_imp <- function(df) {
dfs <- lapply(explan, function(i) permute_column(x, i))
return(vapply(dfs, function(i) get_loss(model, i, y[[1]], loss), numeric(1)))
}
res <- lapply(1:n_rounds, function(x) single_round_imp(df))
res <- as.data.frame(do.call(rbind, res))
res <- as.data.frame(lapply(res, function(x) x/baseline))
names(res) <- names(x)
return(as.data.frame(lapply(res, mean)))
}
# we can do this in a second because R is speedy
perm_vips <- lapply(trained_models, function(x) permutation_importance(x, X_test, y_test, loss_mse, n_rounds = 30))
pander::pander(perm_vips)lm:
| Age | Tonnage | passengers | length | cabins | passenger_density |
|---|---|---|---|---|---|
| 1.037 | 1.053 | 7.129 | 4.361 | 68.85 | 1.086 |
knn:
| Age | Tonnage | passengers | length | cabins | passenger_density |
|---|---|---|---|---|---|
| 1.267 | 1.829 | 2.166 | 3.107 | 2.692 | 1.285 |
rf:
| Age | Tonnage | passengers | length | cabins | passenger_density |
|---|---|---|---|---|---|
| 1.189 | 6.172 | 3.086 | 4.436 | 10.42 | 1.406 |
So, it looks like our importances are quite similar to the importances calculated by the models explicitly, so not a bad job! Lets go ahead and produce a nice plot!
library(tidyverse)
vip_flat <- lapply(perm_vips, gather)
# Finally an appropriate use of superassignment!!
lapply(names(vip_flat), function(x) {
vip_flat[[x]][["model"]] <<- x
}) %>% invisible
do.call( rbind, vip_flat) %>% as.data.frame %>%
ggplot(aes(x = key, y = value, fill = model)) + geom_bar(stat = "identity") +
facet_grid(model ~ ., scales = "free") + coord_flip() + ggthemes::theme_fivethirtyeight() +
ggthemes::scale_fill_fivethirtyeight() + ggtitle("Permutation Importance")Looks like it worked alright!! Especially for random forest, we nicely identified the two most important features, however with little granularity. With our linear model, it is a little less clear, but we do see that the same three most important features carry through! However, we have no real granularity or idea the scale of the effect of the variable. We also carry a risk with permutation importance and correlated variables in general: we are using often unrealistic observations. For example, we may be looking a tiny boats (in length) which weigh the same as the largest boats (they would sink!!) with our permutations. This leads to often unreliable output with many permutation based tools, which we will also explore in this post.
Let’s try and think of a annother way to quantify the effect of a variable. We previously answered the question: “How much worse will my prediction be, given a feature has been replaced with noise?”. Now, lets ask a new question: > What will my prediction be, in general, when a feature is at a specific value?
This is Partial Dependence in a nutshell. Stemming off from this, we can make the claim:
If a feature is more important, the prediction will vary more with that feature than others.
This is exactly the claim made (and supported) in this excellent paper. Thus in this section, we will calculate two things: Partial Dependence, and Partial Dependence Importance. Partial Dependence, at a high level, is calculated like this:
And then we can just take more or less the standard deviation of these curves to estimate importance (its a bit different for categorical variables, see the paper).
Consider again \(x = (x_s, x_c)\)
We can define the partial dependence as the expected value of our prediction function, \(\hat f\), given \(x_c\):
\[\mathrm{PDP}(z_s) = E \left[ \hat f (x_s, x_c) \mid x_c \right]\]
\[ = \int \hat f (x_s, x_c) \mathbb{P}_c(x_c) dx_c\] Where \(\mathbb{P}_c\) is the marginal distribution of \(x_c\), \(\int p(x)dz_s\).
We can then formulate this for a finite set of \(n\) features using the Monte Carlo method:
\[\widetilde {\mathrm{PDP}} (x_s) = \frac{1}{n} \sum_{i=1}^{i=k} \hat f (x_s, x_c^{(i)})\]
where \(x_c^{(i)}\) is a distinct observation(row) of \(x_c\)
First, lets define a function to calculate the partial dependence of a prediction on a single variable:
def pdp_var(model, x, var):
explan = sorted(x[var])
preds = []
for i in range(0, len(explan)):
X_tmp = x.copy()
# pandas is dumb
val = np.asarray(explan)[i]
X_tmp[var] = val
preds.append(model.predict(X_tmp))
preds = np.asarray(preds).reshape(len(x), len(explan))
pv = preds.mean(axis = 0)
return(explan, pv)Next, lets scale that up to work over an entire data frame!
def pdp_df(model, x):
# return a dict of data frames
res = {}
for c in x.columns:
d = pd.DataFrame()
d["Value"], d["Average Prediction"] = pdp_var(model, x, c)
res[c] = d
return(res)Lets now show the partial dependence for just our random forest!
#> /nix/store/drr8qcgiccfc5by09r5zc30flgwh1mbx-python3-3.7.5/bin/python:1: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
for k in range(0,len(rf_pdp.keys())):
ax = fig.add_subplot(2,3, k+1)
#c = cm.Paired(i/len(imps.keys()), 1)
c = sns.color_palette("hls", k+1)[k]
df = rf_pdp[list(rf_pdp.keys())[k]]
sns.lineplot(x = "Value", y = "Average Prediction", ax = ax, color = c, data = df)
ax.set_title(list(rf_pdp.keys())[k])
plt.subplots_adjust(wspace = 0.5, hspace = 0.5)
plt.show()Remember, we said that the variability in partial dependence is equivalent to the importance of a variable. Lets go ahead and write that up too!
def pdp_importance(model, x):
pdpdf = pdp_df(model, x)
v = {k:np.std(pdpdf[k]["Average Prediction"]) for k in pdpdf.keys()}
return(v)
pdp_imps = {get_name(m):pdp_importance(m, X_test) for m in models}
def plot_pdp_imp(d, ax = None, color = "blue"):
df = pd.DataFrame()
df["Variable"] = d.keys()
df["Importance"] = d.values()
df.sort_values("Importance").plot(kind = "barh", x = "Variable", y = "Importance", width = 0.8, ax = ax, color = c)
fig = plt.figure(figsize = (10,10))#> /nix/store/drr8qcgiccfc5by09r5zc30flgwh1mbx-python3-3.7.5/bin/python:1: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
for i in range(len(pdp_imps.keys())):
ax = fig.add_subplot(len(list(pdp_imps.keys())),1, i+1)
#c = cm.Paired(i/len(imps.keys()), 1)
c = sns.color_palette("hls", i+1)[i]
plot_pdp_imp(pdp_imps[list(pdp_imps.keys())[i]], ax = ax, color = c)
ax.set_title(list(pdp_imps.keys())[i])
plt.show()First, lets define a function to calculate the partial dependence of a prediction on a single variable:
pdp_var <- function(model, x, var){
x_sorted <- x[order(x[[var]], decreasing = TRUE),]
# predefine matrix!
out <- matrix(nrow = nrow(x), ncol = nrow(x))
for (i in 1:nrow(x)) {
tmp_df <- x_sorted
tmp_df[[var]] <- tmp_df[[var]][i]
out[i,] <- predict(model, tmp_df)
}
res <- colMeans(out)
return(data.frame("value" = x_sorted[[var]], "avg_pred" = res))
}Next lets scale that up over the entire data frame!
pdp_df <- function(model, x) {
pdps <- lapply(colnames(x), function(n) {
res <- pdp_var(model, x, n)
res$variable <- n
return(res)
})
return(pdps)
}
(rf_pdp <- do.call(rbind, pdp_df(trained_models[[3]], X_test)))Finally, lets plot these pdps!
library(ggthemes)
rf_pdp %>% ggplot(aes(color = variable, y = avg_pred, x = value)) +
geom_line(size = 1.5) +
facet_wrap(variable ~ ., scales = "free") +
theme_fivethirtyeight() + scale_color_hc()Remember, we said that the variability in partial dependence is equivalent to the importance of a variable. Lets go ahead and write that up too! We are going to do a lot of lapply, as we are dealing with a list of lists of dataframes, and trying to quickly get that to a tidy format for ggplot
# return the pdp lists for all trained models
pdps <- lapply(trained_models, function(m) pdp_df(m, X_test))
# calculate the the standard deviation for average predictions
sd_pdp <- function(df) {
imp <- as.numeric(sd(df$avg_pred))
return(data.frame("importance" = imp, "variable" = df$variable[2]))
}
# we have a list of lists of data frames, so things are going to get a little weird
# We need to apply our function at depth of two, hence the double lapply
pdp_imps <- lapply(pdps, function(x) lapply(x, sd_pdp))
# next we need to clean up all the items of our lists are tidy data frames/matrices
pdp_imps <- lapply(pdp_imps, function(x) do.call(rbind, x))
# finally before we can plot, we add a character vector indicating model type
pdp_imps <- lapply(names(pdp_imps), function(x) {
pdp_imps[[x]]$model <- x
return(pdp_imps[[x]])
}
)
# then we turn it all into a nice data frame
pdp_imps <- do.call(rbind, pdp_imps)
pdp_imps%>% ggplot(aes(x = variable, y = importance, fill = model)) + geom_bar(stat = "identity") +
facet_grid(model ~ ., scales = "free") + coord_flip() + ggthemes::theme_fivethirtyeight() +
ggthemes::scale_fill_fivethirtyeight() + ggtitle("Partial Dependence Importance")Something is off!!!